Add Shrutam-2 contrib model: multilingual Indic ASR on Neuron by jimburtoft · Pull Request #142 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-24T05:21:56Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Three-stage ASR pipeline for Shrutam-2 (bharatgenai/Shrutam-2), supporting 12 Indian languages on Trainium2. The pipeline consists of:

Conformer encoder (607.7M params, 24 layers) — traced via torch_neuronx.trace(), 9ms latency for 10s audio
SMEAR-MoE projector (50.4M params, 8 experts) — traced via torch_neuronx.trace(), 1.6ms latency
LLM decoder (1.2B LlamaForCausalLM) — compiled via NxD Inference ImageToTextModelWrapper with audio embedding scatter, 113 tok/s

End-to-end: 20.8 audio-seconds/s single-core, 61.1 audio-seconds/s with DP=4 on trn2.3xlarge. WER delta vs CPU: +1.3% (18/20 FLEURS samples).

Model Information

Model Name: Shrutam-2 (bharatgenai/Shrutam-2)

Model Architecture: Conformer encoder + SMEAR-MoE projector + LlamaForCausalLM decoder

Purpose: Multilingual automatic speech recognition for 12 Indian languages (Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Odia, Punjabi, Assamese, Urdu)

Checklist

Required Components

Accuracy Test (ex. test/integration/test_model.py)
- 9 integration tests across 4 test classes (encoder, SMEAR, LLM, E2E pipeline)
- Encoder/SMEAR: cosine similarity > 0.99 vs CPU reference
- LLM: text generation quality and throughput validation
- E2E: synthetic audio, real FLEURS audio, and throughput tests
- All 9 tests pass on trn2.3xlarge
README.md with the following sections:
- Usage Example: Three-step setup (trace encoder/SMEAR, compile LLM, run pipeline)
- Compatibility Matrix: trn2.3xlarge SDK 2.29 validated
- Example Checkpoints: bharatgenai/Shrutam-2
- Testing Instructions: Full pytest commands with environment variables
Source Code (src/)
- modeling_shrutam2.py: Complete pipeline implementation including Conformer encoder, SMEAR-MoE projector, NxDI LLM wrapper, Shrutam2Pipeline class, and trace/compile utilities
- Follows BioReason-Pro multimodal pipeline pattern

Optional Components

Unit Tests (CPU or Neuron-based)
vLLM Integration

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Shrutam-2/
  README.md
  /src
    __init__.py
    modeling_shrutam2.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All 9 tests executed on trn2.3xlarge (LNC=2, Neuron SDK 2.29, NxDI 0.9.x):

pytest test/integration/test_model.py -v --timeout=600

Test Results:

test_encoder_accuracy_neuron_allclose  PASSED  (cosine sim: 0.9985)
test_encoder_latency                   PASSED  (9ms)
test_smear_accuracy_neuron_allclose    PASSED  (cosine sim: ~1.0)
test_smear_latency                     PASSED  (1.6ms)
test_llm_text_generation               PASSED
test_llm_decode_throughput             PASSED  (113 tok/s)
test_pipeline_synthetic_audio          PASSED
test_pipeline_real_audio               PASSED  (Hindi FLEURS)
test_pipeline_throughput               PASSED  (>30 tok/s)

Compatibility

Tested with:

Neuron SDK Version(s): 2.29
Instance Type(s): trn2.3xlarge (LNC=2)
PyTorch Version: 2.9
Python Version: 3.12.3

Additional Information

The model uses the Pixtral/ImageToText NxDI pattern to scatter audio embeddings into the LLM input sequence, similar to how vision models inject image embeddings
The Conformer encoder weights are FP32 and compiled with --auto-cast matmult for BF16 compute
repetition_penalty=1.3 is recommended for greedy decoding to avoid hallucination on ~5% of samples that are beam-search-dependent
The SMEAR-MoE projector uses utterance-level soft routing (no top-k gating), which causes einsum scaling issues at batch sizes > 1. For batched inference, run SMEAR at BS=1 in a loop.

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Three-stage ASR pipeline for 12 Indian languages: - Conformer encoder (607.7M, traced via torch_neuronx.trace) - SMEAR-MoE projector (50.4M, 8 experts, traced) - LLM decoder (1.2B LlamaForCausalLM, NxDI ImageToTextModelWrapper) Validated on trn2.3xlarge (SDK 2.29, LNC=2): - 20.8 audio-seconds/s single-core, 61.1 audio-s/s DP=4 - +1.3% WER delta vs CPU (18/20 FLEURS samples) - 113 tok/s LLM decode, 9ms encoder, 1.6ms SMEAR

The 24-layer Conformer with BF16 auto-cast produces large relative errors on near-zero elements after LayerNorm/attention, even when cosine similarity is >0.99 and WER delta is only +1.3%. Removed unreliable element-wise relative error assertion; cosine similarity is the validated accuracy metric.

jimburtoft added 2 commits April 23, 2026 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Shrutam-2 contrib model: multilingual Indic ASR on Neuron#142

Add Shrutam-2 contrib model: multilingual Indic ASR on Neuron#142
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/shrutam-2

jimburtoft commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 24, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant